Analysis of Categorical DataIntroduction to categorical dataChi-Square TestsChi-square Goodness of Fit (GoF) testChi-square tests of independence
Many experiments result in measurements that are qualitative (categorical) rather than quantitative (numbers).
Example
We can represent each outcome as a length- vector such that if belongs to the th category and otherwise for .
Note the sample can be summarized by providing the counts of the number of measurements that fall into each of the categories, i.e., by the random vector such that .
Definition (multinomial distribution) Consider a random experiment such that
Then the counts for each category () follow a multinomial distribution with the number of trials and cell probabilities (), i.e. .
The pmf for is
where the cell probabilities sum up to , i.e.,
Remark 1: a Binomial distribution is a special case of the multinomial distribution when , i.e., .
Remark 2: Suppose are i.i.d., such that . Then .
Using observed counts (i.e., a realization of ), we would like to make inferences about the category probabilities .
Setting
We have a random sample where each . Equivalently, we have a random variable such that .
at least one differs from the hypothesized value , ,
where are pre-specified probabilities such that .
Null distribution:
Given a significance level , the rejection region for the observed test statistic : , where the observed test statistic .
The test of "rejecting if the observed test statistic exceeds " is an approximate sig. level test.
Remark 1: this is an asymptotic test. The sample size needs to be large to approximate the null distribution with the distribution. A rule of thumb is to check whether for . In practice, we check whether each observed count exceeds or not.
Remark 2: this test has a connection with the asymptotic LRT. In fact, it can be shown that when is large. In particular, the degrees of freedom of the distribution is the difference of the number of free parameters in the null and full parameter spaces.
Example: A group of rats, one by one, proceed down a ramp to one of three doors. We wish to test the hypothesis that the rats have no preference concerning the choice of a door. Suppose that the rats were sent down the ramp times and that the three observed cell frequencies were , and .
The test above can be used to test whether sample data are from a given distribution or not.
Idea: partition the range of into distinct regions () and compare observed and expected counts for each region.
Example Let denote the number of heads that occur when four identical coins are tossed at random. Under the assumption that the four coins are independent and the probability of heads on each coin is , is . One hundred repetitions of this experiment resulted in , and heads being observed on , and trials, respectively. Do these results support the assumptions? Make a conclusion at .
Often, we are interested in testing whether the random variable of interest is from a certain family of distributions or not. Note, in such case, the distribution of is not completely specified under .
where is a ML estimator of under (i.e., ).
Null distribution:
where is the length of (i.e., the number of free parameters under ).
Given a significance level , the rejection region for the observed test statistic : , where the observed test statistic .
The test of "rejecting if the observed test statistic exceeds " is an approximate sig. level test.
Example The number of accidents per week at an intersection was checked for weeks; Out of weeks, weeks had no accidents, weeks had one accident, and weeks had accidents. Test the hypothesis that the random variable has a Poisson distribution, assuming the observations to be independent. Use .
Two-way contingency table (a table for two categorical variables)
Example:
A statistics 415 instructor wants to know if there is a relationship between favorite color (red or yellow) and the preferred condiment on a corn dog. The following table summarizes the results.
Condiment | |||
---|---|---|---|
Color | Ketchup | Mustard | Total |
Red | |||
Yellow | |||
Total |
A general example of our contingency table with two classifying factors can be displayed as follows.
Total | |||||
---|---|---|---|---|---|
Total |
In some problems, the counts , , can be modeled using a multinomial distribution
Assume each of the observations results in an outcome that can be classified by two attributes
The first attribute is randomly assigned to one and only one of mutually exclusive and exhaustive events and the second attribute falls into one and only one the mutually exclusive events
Define
Chi-square tests of independence
Chi-square tests of independence aim to answer the question: for a single observation, is the row assignment statistically independent of the column assignment?
, ,
vs.
for at least 1 pair.
Defining
we can rewrite and as
, ,
for at least 1 pair
where
the observed value at cell :
Example:
Test whether there is a relationship between favorite color (red or yellow) and the preferred condiment on a corn dog at .
Condiment | |||
---|---|---|---|
Color | Ketchup | Mustard | Total |
Red | |||
Yellow | |||
Total |